This notebook tries to provide a one-click-runs-all codes for data loading, processing, EDA, clusters and figure plotting.
The folder structure of this project should be like:
├── (Your name for the project folder)
├── README.md
├── data
│ ├── Introducing_the_Enron_Corpus.pdf
│ ├── enron_mail_20150507
│ │ └── maildir
│ └── klimt-ecml04.pdf
├── figures
│ ├── email_all_cluster.png
│ ├── email_all_network.png
│ ├── email_inbox_cluster.png
│ └── email_inbox_network.png
├── final_project.Rmd
├── final_project.html
├── notebooks
│ └── experiments.Rmd
├── results
│ └── community_assignments.csv
└── scripts
├── data_loading.R
├── download_enron.R
├── starting_code.R
└── trial_igraph.R
Make sure the folders are correctly structured as above for reproducibility concerns!
# set the paths and working directory
notebook_path <- rstudioapi::getActiveDocumentContext()$path
mother_path <- dirname(dirname(notebook_path))
# set the path for the project, data, scripts, etc.
Sys.setenv(mother_path = mother_path)
script_path <- paste0(mother_path, "/scripts")
data_path <- paste0(mother_path, "/data")
results_path <- paste0(mother_path, "/results")
Now download the data (if needed) and load all the data we need for
further investigation. If you want to load the data by yourself or using
different filters, please check the /scripts/data_loading.R
for customized operations.
# detect if the data is already downloaded, if not, then download and untar it.
if (!file.exists(paste0(data_path, "/enron_mail_20150507"))) {
message("No dataset detected, start downloading, please wait patiently.")
source(paste0(script_path, "/download_enron.R"))
download_data()
}
# load all the data we need for analyses
load(paste0(results_path, "/dfs.Rdata"))
Before we look into the data, we first clarify two maybe confusing concepts:
name: the folder name of a user, for example,
allen-p or causholli-m.
mailname: the name(s) that a user would use in his
email address (excluding the domain name “@enron.com”), for example,
allen-p has both “phillip.allen” and “k..allen” as his
mailnames.
Using these two concepts, we make a simple explanation on the content of the dataframes and the way to get them:
users: the dataframe containing all the active
users’ name, mailname and the absolute path to
their folder.
inboxes.within.fromto.df: the dataframe contains the
from and to name of the emails extracted from all users’
inboxes.
all.within.fromto.df : the dataframe contains all
the from and to name of the emails extracted from all
users’ all email folders.
The data loading and processing procedure for
all.within.fromto.df is as following:
List all the mail files’ paths and form a dataframe.
Read the From lines of the emails, extract only the emails sent
by users within company. Use the user’s name as To, and
form a dataframe containing all the From and To information.
Create the users dataframe by extracting all the
mailnames they use, and match the list with their
name.
Filter all the emails which From and To users are both within the
company (i.e. included in the users).
REMARK: There is a sub-folder in the “/sent_items/” folder in pereira-s named “clickathome”, after checking the only email’s content in it (an advertisement), we choose to remove it from our investigation.
We conduct EDA on both inbox mail data and all mail data. For both dataset, we plot some exploratory figures to see if there are clear patterns or interesting trends.
First we plot histograms of the number of emails every user sent/received.
We also plot a figure for the number filtered by 50, just for better visualization.
# make plots for filtered data
filtered_sent <- inboxes.within.fromto.df %>%
group_by(from) %>%
filter(n() >= 50) %>%
mutate(num_from = n()) %>%
ungroup() %>%
mutate(from = factor(from, levels = names(sort(table(from), decreasing = TRUE))))
hist_sent <- ggplot(filtered_sent, aes(x = from)) +
geom_bar() +
labs(title = "histogram of inbox emails sent within company (>=50)", x = 'Sent by:') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
filtered_receive <- inboxes.within.fromto.df %>%
group_by(to) %>%
filter(n() >= 50) %>%
mutate(num_to = n()) %>%
ungroup() %>%
mutate(to = factor(to, levels = names(sort(table(to), decreasing = T))))
hist_received <- ggplot(filtered_receive, aes(x = to)) +
geom_bar() +
labs(title = "histogram of inbox emails received within company(>=50)", x = 'Sent to:') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot(hist_received)
plot(hist_sent)
It is quite interesting that, grigsby-m, the one who sent the most emails to others’ inboxes does not even show up in the filtered figure of inbox email received. From some outside source, we know that grigsby-m is actually titled as VP Trading, ENA Gas West. No wonder he would send tons of emails to other users. Now we cross-check some of the busiest inbox users:
common_users <- intersect(filtered_sent$from, filtered_receive$to)
print(common_users)
## [1] "watson-k" "tycholiz-b" "dasovich-j" "whitt-m" "nemec-g"
## [6] "shackleton-s" "heard-m"
Among them, tycholiz-b is VP Trading, ENA Gas West, shackleton-s is VP ENA & Senior Counsel, dasovich-j is Dir State Government Affairs, heard-m is Specialist Legal.
Also, we check the mail clusters and social network plots:
set.seed(321)
# Get the unique senders and recipients
all_names <- unique(c(inboxes.within.fromto.df$from, inboxes.within.fromto.df$to))
# Create a contingency table with a predefined set of row and column names
mail_count_table <- table(factor(inboxes.within.fromto.df$from, levels = all_names), factor(inboxes.within.fromto.df$to, levels = all_names))
# print(table_inboxes_from_to_within)
mail_count_df <- melt(mail_count_table)
# Create a heatmap using ggplot2
ggplot(mail_count_df, aes(x = Var1, y = Var2, fill = log(value))) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Email Interaction Heatmap", x = "From", y = "To", fill = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Create a graph from the matrix
graph <- graph_from_adjacency_matrix(mail_count_table, mode = "undirected", weighted = TRUE, diag = FALSE)
# Filter edges with low weight (e.g., below a threshold)
# graph <- delete_edges(graph, E(graph)[weight < 5])
community <- cluster_louvain(graph)
png(paste0(mother_path, "/figures/email_inbox_cluster.png"), width=3200, height = 3200)
plot(community, graph, vertex.size=4)
dev.off()
## quartz_off_screen
## 2
layout <- layout_with_fr(graph) # Fruchterman-Reingold layout (often better for clarity)
png(paste0(mother_path, "/figures/email_inbox_network.png"), width = 3200, height = 3200) # Width and height in pixels
# Plot the network graph with adjustments
plot(graph,
vertex.size = 5, # Larger nodes
vertex.label.cex = 0.8, # Adjust text size
edge.width = E(graph)$weight, # Edge width based on the weight (email count)
layout = layout, # Use the new layout for better node spacing
main = "Email Interaction Network",
vertex.label.color = "black", # Change label color for contrast
vertex.color = "lightblue", # Node color
edge.arrow.size = 0.5, # Adjust arrow size on edges
edge.color = "gray", # Edge color
vertex.label.dist = 1, # Distance between label and node
vertex.frame.color = "white") # Frame color around nodes
# Close the PNG device (save the plot)
dev.off()
## quartz_off_screen
## 2
In the inbox cluster/network figure, it seems that causholli-m forms a single cluster herself. Let’s check what happened to causholli-m?
causholli_inbox <- inboxes.within.fromto.df %>%
filter(to == 'causholli-m')
causholli_inbox
## from to
## 1 causholli-m causholli-m
It is just one email she sent herself, and that’s why she is forming an isolated group. This also enlightens us, considering only the inbox mails is far from enough!!
From now on, we focus on all the emails within the enron company:
We also make the same histograms to view the top senders/receivers:
Now the results are quite convincing when we consider all the mails within the users. The highest in both From and To, is Kay Mann (mann-k), who was the head of legal for Enron. The fact that she sent so many emails is ironical, seeing as how Enron was breaking every law in the book. Besides, the newly-added users are germany-c, Capacity Trader, jones-t, Senior Legal Specialist, scott-s, Assistant Trader, sager-e, VP & Assistant General Counsel.
common_users <- intersect(filtered_sent$from, filtered_receive$to)
print(common_users)
## [1] "arnold-j" "beck-s" "dasovich-j" "shackleton-s"
## [5] "mann-k" "jones-t" "bass-e" "lenhart-m"
## [9] "scott-s" "fossum-d" "symes-k" "germany-c"
## [13] "nemec-g" "perlingiere-d" "sager-e" "rodrique-r"
## [17] "stclair-c"
We also plot the clusters and network using all the email sending/receiving information.
set.seed(123)
# Get the unique senders and recipients
all_names <- unique(c(all.within.fromto.df$from, all.within.fromto.df$to))
# Create a contingency table with a predefined set of row and column names
mail_count_table <- table(factor(all.within.fromto.df$from, levels = all_names), factor(all.within.fromto.df$to, levels = all_names))
# print(table_inboxes_from_to_within)
mail_count_df <- melt(mail_count_table)
# Create a heatmap using ggplot2
ggplot(mail_count_df, aes(x = Var1, y = Var2, fill = log(value))) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Email Interaction Heatmap", x = "From", y = "To", fill = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Create a graph from the matrix
graph <- graph_from_adjacency_matrix(mail_count_table, mode = "undirected", weighted = TRUE, diag = FALSE)
# Filter edges with low weight (e.g., below a threshold)
# graph <- delete_edges(graph, E(graph)[weight < 5])
community <- cluster_louvain(graph)
png(paste0(mother_path, "/figures/email_all_cluster.png"), width=3200, height = 3200)
plot(community, graph, vertex.size=4)
dev.off()
## quartz_off_screen
## 2
layout <- layout_with_fr(graph) # Fruchterman-Reingold layout (often better for clarity)
png(paste0(mother_path, "/figures/email_all_network.png"), width = 3200, height = 3200) # Width and height in pixels
# Plot the network graph with adjustments
plot(graph,
vertex.size = 5, # Larger nodes
vertex.label.cex = 0.8, # Adjust text size
edge.width = E(graph)$weight, # Edge width based on the weight (email count)
layout = layout, # Use the new layout for better node spacing
main = "Email Interaction Network",
vertex.label.color = "black", # Change label color for contrast
vertex.color = "lightblue", # Node color
edge.arrow.size = 0.5, # Adjust arrow size on edges
edge.color = "gray", # Edge color
vertex.label.dist = 1, # Distance between label and node
vertex.frame.color = "white") # Frame color around nodes
# Close the PNG device (save the plot)
dev.off()
## quartz_off_screen
## 2
From the email_all_cluster figure, we pick two best performing clusers (with human eye), get the members and export them into cluster1 and cluster2. Now under cluster1 and cluster2, we have the names, email file directories and mailnames of these users within.